Probabilistic Model for Segmentation Based Word Recognition with Lexicon

نویسندگان

  • Sergey Tulyakov
  • Venu Govindaraju
چکیده

The problem of off-line reading of unconstrained handwritten words has been studied extensively due to its role in many important applications such as reading addresses on mail-pieces [3, 6, 11], reading amounts on bank checks [7, 10], extracting census data on forms [2, 9], and reading address blocks on tax forms [12]. The main challenges are wide variety of writing styles, poor image quality and missing or extraneous strokes caused by segmentation errors. The intuitive solution to the problem is to segment the word image into probable character sub-images, then try to recognize separate characters and combine results [4, 8]. The function of optical character recognizer (OCR) used is to provide confidence scores for supposed character images. Although many different OCRs are available, they are all mainly focused on classifying isolated character images. In practice, when dealing with unconstrained handwritten word images there is no guarantee that segmented sub-images will be single isolated characters. So OCR used for word recognition should be able to provide low confidence scores for non-character images. Besides choosing the right OCR for word recognition, it is also important to know how to incorporate OCR confidence scores for individual characters into an overall confidence score for the entire lexicon word should we take arithmetic mean, geometric mean or some other normalizing formula? This question addresses what OCR score truly means. For example, given an image and a hypothesis character should the OCR produce a score representing the posterior probability or the prior probability ? In this paper we describe the construction of a possible mathematical model for word recognizers that are based on the segmentation paradigm and use of a lexicon. The construction of the model is motivated by the comparison of two word recognizers existing in CEDAR: CMR ( Character Model Recognizer )[4] and WMR ( Word Model Recognizer )[8]. These recognizers use similar preprocessing and segmentation techniques. Using seemingly inferior character recognizer, WMR is able to perform better than CMR on word images. CMR uses the GSC (Gradient, Structural, Concavity)[5] character recognizer which is widely accepted as being very accurate.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A probabilistic method for keyword retrieval in handwritten document images

Keyword retrieval in handwritten document images (word spotting) is very challenging given that OCR accuracy is not yet adequate for handwritten scripts, specially with large lexicons. Various proposed approaches build indices on information such as image features or OCR scores and have improved the performance of the traditional approach that builds index on OCR’ed text. In this paper, we impr...

متن کامل

یک روش دو مرحلهای برای بازشناسی کلمات دستنوشته فارسی به کمک بلوکبندی تطبیقی گرادیان تصویر

This paper presented a two step method for offline handwritten Farsi word recognition. In first step, in order to improve the recognition accuracy and speed, an algorithm proposed for initial eliminating lexicon entries unlikely to match the input image. For lexicon reduction, the words of lexicon are clustered using ISOCLUS and Hierarchal clustering algorithm. Clustering is based on the featur...

متن کامل

A hybrid large vocabulary handwritten word recognition system using neural networks with hidden Markov models

In this paper we present a hybrid recognition system that integrates hidden Markov models (HMM) with neural networks (NN) in a probabilistic framework. The input data is processed first by a lexicon–driven word recognizer based on HMMs to generate a list of the candidateN–best– scoring word hypotheses as well as the segmentation of such word hypotheses into characters. An NN classifier is used ...

متن کامل

Offline cursive handwriting recognition system based on hybrid Markov model and neural networks

An offline cursive handwriting recognition system, based on hybrid of Neural Networks (NN) and Hidden Markov Models (HMM), is described in this paper. Applying SegRec principle, the recognizer does not make hard decision at the character segmentation process. Instead, it delays the character segmentation to the recognition stage by generating a segmentation graph that describes all possible way...

متن کامل

On-line Handwritten Uyghur Word Recognition Using Segmentation-Based Techniques

An approach to online handwriting word recognition using segmentation-based techniques is presented in this paper. This approach is referred to as lexicon-driven approach because an optimal segmentation is generated for each string in the lexicon. Word recognition problem is transformed into matching optimization problems between the dictionary entry and the handwritten word image. The segmenta...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001